-
Notifications
You must be signed in to change notification settings - Fork 689
Add pybindings for multimodal LLM runner #14285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/14285
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 3 Pending, 1 Unrelated FailureAs of commit f8ace7d with merge base c9f46e2 ( NEW FAILURE - The following job has failed:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
extension/llm/runner/__init__.py
Outdated
ValueError: If the image format is not supported | ||
FileNotFoundError: If the image file doesn't exist | ||
""" | ||
if isinstance(image, (str, Path)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't you use the CV preprocessing utils function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah let me fix. Recent updates made sure it works with Gemma3, exported using optimum-et.
d352449
to
0ad3c71
Compare
a23bee6
to
f68cc69
Compare
f68cc69
to
15a7e6a
Compare
09cccec
to
0a135d1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.
.def("is_audio", &MultimodalInput::is_audio) | ||
.def("is_raw_audio", &MultimodalInput::is_raw_audio) | ||
.def( | ||
"get_text", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not totally convinced all these getter impl are correct
80b7130
to
9d5844a
Compare
print(f"Image: {image.width}x{image.height}x{image.channels}") | ||
|
||
# Check input types safely | ||
if text_input.is_text(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would a user ever need to do this?
"""Reset the conversation state""" | ||
self.runner.reset() | ||
|
||
# Usage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this section just be a demo.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah would be good to have a notebook. But I'll leave it here for now
🎉 |
This pull request introduces Python bindings for the ExecuTorch MultimodalRunner, enabling Python users to run multimodal LLM inference (supporting text, image, and audio inputs) and generate text outputs. The changes include new build system integration, a detailed implementation plan and documentation, and a high-level Python API with robust input handling and error management.
Python Bindings Implementation:
__init__.py
for the MultimodalRunner, providing user-friendly methods for text and image input creation, text generation (with or without streaming callbacks), and resource management. The API includes comprehensive input validation, support for multiple image formats (file path, NumPy array, PIL), and fallback mechanisms if dependencies are missing.Build System Integration:
CMakeLists.txt
to add apybind11
-based Python extension module (_llm_runner
) whenEXECUTORCH_BUILD_PYBIND
is set, linking all necessary dependencies and setting up include paths.Documentation and Planning:
README.md
.Utility and Extensibility:
load_image_from_file
,preprocess_image
,create_generation_config
) for easier input preprocessing and configuration from Python.Testing and Examples (Planned):
test_runner_pybindings.py
.Code Snippet of How to Use:
Output from console:
cc @mergennachin @cccclai @helunwencser @jackzhxng